Beginners mind (Shoshin)

Ben Whalley, Paul Sharpe, Sonja Heintz, Andy Wills

Beginners mind (Shoshin) denotes openness, eagerness and lack of preconceptions when studying a subject, just as a beginner would, no matter what level of expertise the student has.

Even black belt martial artists practice basic techniques like blocks and punches every time they train.

This session doesn’t assume any prior knowledge of R, and introduces the basics. For some this will be revision from last year, but we provide additional material for advanced students test their knowledge and extend familiar skills. Even if you are quite confident with R Studio from Stage 1, please read the worksheet carefully and complete all of the activities in the blue boxes.

General principles

  • Reproducibility and transparency in science (as a motivation for using R)
  • Precision and attention to detail as an important skill.

Using the RStudio interface

These worksheets assume that you are using a web browser to access the RStudio Server at Plymouth University.

NOTE: RStudio works on most web browsers (e.g. Firefox, Safari, Chrome) but does not work that well on the default web browser in Windows 10 (“Edge”). If you’re using Windows, we recommend downloading Firefox and using that. Firefox is free and open source.

When you login to RStudio, you’ll be greeted with a screen that looks something like the image below.

RStudio on first opening

When you open RStudio for the first time, you can see three parts:

  1. The Console - This is the large rectangle on the left. This is where you tell R what to do, and it’s also where R prints the answers to your questions.

  2. The Environment - This is the rectangle on the top right. This is where R keeps a list of the data it knows about. It’s empty at the moment, because we haven’t given R any data yet.

  3. The Files - This is the rectangle on the bottom right. This is a bit like the File Explorer in Windows, or the Finder on a Mac. It shows you what files and folders R can see.

You should also be able to see that the two rectangles on the right have a number of other “tabs.” These work like tabs on a web browser.

The top rectangle has the tabs “Environment” and “History.” The History tab keeps a record of commands you’ve recently typed into the Console. This can sometimes be useful.

The bottom rectangle has the tabs “Files,” “Plots,” “Packages,” “Help,” and “Viewer.” We’ll cover what these other tabs do later on.

Before you start

Before starting this module, you need to run an R command which makes a folder and downloads the files you will need for each workshop.

  1. Click on the Console pane
  2. Copy-paste the following command into the console

source('https://raw.githubusercontent.com/benwhalley/lifesavR/main/bootstrap.R')

Your console should now look like this:

Press return (enter) to run the command. If your console looks like the image below, then you are ready to start the session.

In each session you will work in a single file, which we will refer to as a workbook. The file you need for this session is called session-1.rmd. If we click on the file it opens the workbook in a tab of a new pane, called the Source pane. The commands you will learn to write in this course are called ‘R code,’ which is shorthand for ‘R source code.’ This pane allows you to write R code and explore your data.

Click on session-1.rmd in the Files pane.

You’re now ready to start the session.

What can R do?

Watch the following short video to see a few things that R can do.

#todo #video of what R can do to provide context for where we’re going and what we’re doing.

RStudio is a user interface to R, which is a computer language that is primarily designed for data analysis and visualisation. R is a text-based language, so you interact with it by typing commands, and then running those commands with R. R Studio makes it easy to run R commands and organise your work. For example, you can do simple arithmetic (2 * 221), generate some random numbers (rnorm(10, 0,1)), and plot some random numbers (hist(rnorm(100, 0,1)).

You should think of R as a robot. The robot is extremely fast, powerful and tireless, but it’s also literal-minded, and won’t think for itself or take the initiative. You need to tell it exactly what to do, by providing very precise instructions.

Working interactively in R Markdown

Click on the lifesavr folder in the Files pane. Notice than some files have the extension .rmd. These are R Markdown files. It is important that any R Markdown file you create has the extension .rmd (or .Rmd), because this is how RStudio knows what they contain.

R Markdown is a way of combining R with narrative text. It allows you to integrate the results of your data analysis into high quality reports, research papers, dissertations or books. Because it’s such a powerful tool, this module provides an early, gentle introduction to R Markdown.

RStudio needs to distinguish R code from narrative text. This is done by putting the code inside some special characters, creating what’s referred to as a chunk. A chunk is opened using the symbols ```{r}, and closed using the symbols ```. This is what a chunk looks like in RStudio (this chunk has been given the optional name life):

A code chunk in the RMarkdown editor

NOTE: The symbols which start and end a chunk are backticks, not single quotes.

On windows

On a Mac

Running R code within a chunk

Watch the following short video to see how to run code within a chunk.

#todo re-record #video running code in a chunk.

This short video shows you three ways to run R code within a chunk. The first is to run a complete line of code. You can see here that our cursor is on line 12. The cursor can be anywhere on that line. To run the line, press Ctrl + on Windows or Linux, or + ↩︎ on a Mac.

You’ll see some output beneath the chunk that you don’t need to worry about for now, but one of the effects of running this command is to load a dataset about diamonds.

The cursor has been automatically positioned online 13. Lines 13 to 15 are actually part of the same command. We use the same keys, Ctrl + , to run these lines, which generate a scatter plot using the diamonds dataset. Don’t worry how these commands work for now.

The second way to run code is to select only the commands you want to execute. If you select just the word diamonds on line 13 and run that, you will see that it does something different. This prints the contents of the diamonds data. Because the dataset is large, it just prints the first few rows.

Finally, you often want to run all of the code in a chunk. This can be done by pressing the green arrow on the right hand side of the chunk. Another way to run all of the code is to position your cursor anywhere within the chunk and press Ctrl + + (Windows, Linux) or + + ↩︎ (Mac).

  1. Locate the first chunk in session-1.rmd
  2. Place your cursor (anywhere) on the line that says library(tidyverse)
  3. Run the commands by pressing Ctrl + (Windows, Linux) or + ↩︎ (Mac)

You will see some output appear beneath the chunk. Don’t worry about the details for now, we’ll explain those later.

Now position your cursor on the line that says diamonds and run the commands.

You should see the a scatter plot of the diamonds data appear below the chunk:

Congratulations! You have just run your first lines of R. The code to produce the plot consisted of three lines. You can also run part of a line by highlighting the code you want to run:

  1. Select (highlight) the word diamonds
  2. Run the code

This prints the first few lines of the diamonds data:

Example of running highlighted code

Why would you want to run part of a line of code? In these workshops you will combine simple steps into sequences which do a particular job, such as generating a plot. It’s natural, especially when you’re new to R, that the full sequence of commands won’t do exactly what you want first time. Running part of your code allows you to identify the steps which are correct. This allows you to modify subsequent steps until your code produces the required results. Remember this technique as you will be using it extensively in these workshops.

Variables

Using variables lets us store calculations for later. A variable is a name which can be assigned a value using the assignment operator: <-.

Run the two lines in the chunk named life.

The results should look like this:

Results of running life chunk

Line 24 runs the calculation 40 + 2, then assigns the result to the variable meaningoflife. The assignment operator <- looks like an arrow that points to the left. This is a reminder that the results of the calculation on the right hand side will be assigned to the variable on the left hand side. Line 25 displays the value of meaningoflife.

Variables that you create are stored in what’s called the Global Environment. You can see them in the Environment pane.

Variables that you create are stored in the Global Environment

Inserting a chunk

You insert a new chunk by positioning your cursor on the line where you want the chunk to appear, and selecting the Code > Insert Chunk menu option:

Insert a new chunk

There are also keyboard shortcuts for inserting a chunk:

Windows, Linux: Ctrl + Alt + I

Mac: + I

Exercise 1

  1. Find the instructions for Exercise 1 in your workbook
  2. Create a new chunk below the instructions
  3. Inside the chunk, write a line of code which adds together the numbers 9, 4, 55 and 2, and assigns the result to a variable named sum.
  4. Run the the line of code you have written

After completing these steps, your environment should look like this:

Environment after Exercise 1

Loading packages

A powerful feature of R is that it can be extended to analyse or plot data in any way imaginable. A package (sometimes called a library) is an extension to R that adds new commands. Packages are loaded using the library() command.

The first command you ran above was library(tidyverse). This loaded the commands needed to create the scatter plot, and also the diamonds data. The tidyverse package is so fundamental to this course that library(tidyverse) is likely to be the first line of R in the first chunk of each of your R Markdown files.

If you’ve understood what packages are then it should be clear that you can’t use the commands provided by tidyverse (and the additional packages it loads) until you’ve run the command library(tidyverse).

For example, if you tried to produce the scatter plot before loading tidyverse you’d see an error like this in the console:

Error in diamonds %>% ggplot(aes(carat, price, colour = clarity)) : 
  could not find function "%>%"

We mention this here, as could not find function errors are one of the most common problems that beginners encounter. They normally mean that you have

  1. forgotten to include library(tidyverse) as the first line in your code, or
  2. forgotten to run that line.

Built-in datasets

A dataset is a set of data relating to a particular topic. Most datasets we will be working with consist of rows and columns, just like a spreadsheet. In R this type of data is stored in a special type of variable called a data.frame. You will also see references to datasets as tibbles. A tibble is just a special type of data.frame, so you can treat the two types of variable as being equivalent.

R has a number of built-in datasets, and more can be loaded from packages.

#todo #video of built-in datasets.

One data.frame that is built-in to R is called mtcars. This is a dataset about cars that was published in a US magazine called Motor Trend. Let’s display this data in using a new chunk. As we did with the diamonds tibble, if we type mtcars, select the variable name and execute it, we can see the data it contains.

By default this displays only the first ten rows and columns of the data. You can see other rows using the Next, Previous and number buttons below the data. You can see additional columns using the arrow next to the final, right-hand column.

You already know that the diamonds dataset was loaded using library(tidyverse). The gapminder package includes a tibble that contains data about life expectancy, GDP per capita, and population by country. We can load and explore this dataset in a the same way we loaded diamonds dataset. We load the gapminder package, type the name of the tibble (also gapminder) and run it. Again, we can use the navigation buttons to explore the data.

Try this out in your workbook:

  1. Create a new chunk at the bottom of your worksheet
  2. Display the mtcars data.frame and try out the navigation buttons
  3. Load the gapminder package, display the gapminder tibble and explore the data

Exploring and checking data

R Studio has a few different ways to explore datasets.

The glimpse command

#todo #video of glimpse.

Using the glimpse command we can have the columns of a dataset run down the page, and data run across. The command mtcars %>% glimpse() demonstrates this. We’ll explain what the %>% command means shortly. This is like rotating the output you saw earlier anti-clockwise by 90 degrees. The command displays as many observations from the dataset as will fit on a single line.

Each column in a data.frame should be thought of as a variable. Variables have an associated type. When you assigned a number to a variable earlier, behind the scenes, R assigned the variable as numeric. glimpse is useful as the second column in the output shows us the type of each variable. As you can see, all variables in mtcars have the type dbl. This is short for ‘double-precision number’; for now, just know that dbl means a number.

We can see other variable types using glimpse to display the gapminder data.

  • int is short for ‘integer,’ a variable which contains whole numbers (e.g. a participant id number)
  • fct is short for ‘factor,’ a categorical variable (e.g. a specific response to a multiple-choice question)

Other types include:

  • chr — short for ‘character,’ a variable which contains text (e.g. an email address), and
  • ord — short for ‘ordered’; a variant of fct where the categories have a particular order (e.g. responses like ‘Wost’ < ‘Better’ < ‘Best’)

We’ll return to why it’s important to know the types of your variables shortly.

Exercise 2

  1. Create a new chunk at the bottom of your worksheet
  2. Use glimpse to display the mtcars, gapminder and diamonds datasets

Now use the output from Exercise 2 to answer the following question. After entering your answer, click outside the box. The border will turn turn blue when the answer is correct.

  • The clarity variable in the diamonds tibble is of type .

The head command and the Environment pane

#todo #video of head.

The head command allows us to print just a few rows from a dataset. The command mtcars %>% head() prints the first six rows of mtcars. We could also assign this output to a new variable.

You can also explore your variables using the Environment pane. A data.frame will have an icon that looks like a spreadsheet. If you click on the icon, the data.frame is displayed in a new tab in the Source pane.

This tab shows you the same information as printing the data.frame, such as the number of rows and columns, but it also provides tools for exploring the data interactively.

  • The arrows next to the column names allow you to arrange the rows in ascending or descending order based on the column values.
  • The Filter button allows you to specify a value for one or more columns to filter out non-matching rows. For example, we could display just cars with 4 gears. Click the button again to turn off the filter.

Exercise 3

  1. Create a new chunk at the bottom of your worksheet
  2. Use head to assign the first six rows of gapminder to a variable called gapminder_head

The population of Afghanistan in 1967 was .

The pipe

The pipe command %>% sends data from one piece of code to another. For example, you’ve already seen that mtcars %>% head() ‘pipes’ mtcars into head, which shows just the first few rows. Think of your data as flowing along a pipe with commands which transform it, step by step, until it plops out the end in the format you want. The pipe command is designed to remind you of this analogy. The % symbols look a bit like two ends of a pipe and the > reminds you of the direction in which your data is flowing.

Another dataset loaded by the tidyverse package is mpg. This contains fuel economy data from 1999 to 2008 for 38 popular models of US cars.

Exercise 4

  1. Create a new chunk at the bottom of your worksheet
  2. Pipe the mpg data.frame into head and assign the results to a variable called mpg_head

In 1999, a 6 cylinder Audio A4 could cover miles per gallon when driven on the highway.

ggplot: Scatter plots

The other place where you’ve encountered the pipe is in the code you used to make a scatter plot of the diamonds data. You pipe data into ggplot to build up plots step by step.

#todo scatter plots #video.

This chunk creates a scatter plot by piping the mpg data into a ggplot command. The plot is built in two steps. The first step, ggplot(aes(cyl, hwy)) selects variables for the x and y axes. In this case, cyl will be the x-axis and hwy the y-axis. We can see this by running just this part of the command.

In ggplot, each step is separated by + and goes on a new line. Because R Studio knows this is all part of the same ‘pipeline,’ it automatically indents the code.

The second step geom_point() is a command which plots the data as points. If we run the chunk, we see a scatter plot. A point is plotted for each row, using the values for the cyl and hwy variables. The x-axis shows the the number of cylinders in the car, and the y-axis shows the miles covered by the car per gallon of fuel when driven on the highway.

A problem with this plot is that there are many cars with the same number of cylinders. Another way of saying this is that cyl has been plotted using an ‘integer (whole number) scale.’ Consequently, many of the points are plotted on top of each other, making the values hard to see. The overplotting issue is easily solved by replacing geom_point() with geom_jitter(), which spreads the points out where cars have the same number of cylinders.

Copy the following two chunks into your worksheet and run them.

This chunk creates a scatter plot. The x-axis shows the number of cylinders in the car. The y-axis shows the miles covered by the car per gallon of fuel when driven on the highway.

mpg %>% ggplot(aes(cyl, hwy)) +
  geom_point()

This chunk plots cylinders against highway miles per gallon. The points on the scatter plot are ‘jittered’ to avoid overplotting, making the output easier to read.

mpg %>% ggplot(aes(cyl, hwy)) +
  geom_jitter()

Exercise 5

  1. Create a new chunk at the bottom of your worksheet
  2. Pipe the mpg data.frame into a ggplot command which creates a scatter plot showing mpg against cty (miles per gallon when cars are driven in a city)

Your plot should look like this:

Fuel efficiency is better when cars are driven .

ggplot: Boxplots

The following chunk uses the gapminder data to display life expectancy by continent.

gapminder::gapminder %>%
  ggplot(aes(continent, lifeExp)) +
  geom_boxplot()

In a boxplot, the thick line is the median. That thick line is enclosed inside a rectangle (the ‘box’), and the size of the box indicates the inter-quartile range (IQR). The IQR contains the middle 50% of the ordered data. A wider IQR indicates greater variation in a dataset.

The top and bottom of the box are called ‘hinges.’ The vertical lines connected to each hinge are called ‘whiskers,’ and give some indication of the broader range of the data. Exactly what the whiskers show differs depending on the particular command you use to draw a boxplot. In this case, the upper whisker shows the largest data point that is no more than 1.5 times the IQR above the upper hinge. The lower whisker is the lowest point no more than 1.5x the IQR below the lower hinge. In this version of a boxplot, any data point outside the range of the whiskers is described as an ‘outlying point’ and is shown individually as a dot.

Scales and types

This boxplot uses mtcars to display car transmission types (automatic or manual) against fuel consumption (miles per gallon).

mtcars %>%
  ggplot(aes(am, mpg)) +
  geom_boxplot()
Warning: Continuous x aesthetic -- did you forget aes(group=...)?

This does not plot the two boxes (automatic and manual) on the x-axis that we expected. Instead it plots a single box, with an x-axis scale ranging from 0–1.

If we check the variable types using glimpse we can see why.

mtcars %>% glimpse()
Rows: 32
Columns: 11
$ mpg  <dbl> 21.0, 21.0, 22.8, 21.4, 18.7, 18.1, 14.3, 24.4, 22.8, 19.2, 17.8,…
$ cyl  <dbl> 6, 6, 4, 6, 8, 6, 8, 4, 4, 6, 6, 8, 8, 8, 8, 8, 8, 4, 4, 4, 4, 8,…
$ disp <dbl> 160.0, 160.0, 108.0, 258.0, 360.0, 225.0, 360.0, 146.7, 140.8, 16…
$ hp   <dbl> 110, 110, 93, 110, 175, 105, 245, 62, 95, 123, 123, 180, 180, 180…
$ drat <dbl> 3.90, 3.90, 3.85, 3.08, 3.15, 2.76, 3.21, 3.69, 3.92, 3.92, 3.92,…
$ wt   <dbl> 2.620, 2.875, 2.320, 3.215, 3.440, 3.460, 3.570, 3.190, 3.150, 3.
$ qsec <dbl> 16.46, 17.02, 18.61, 19.44, 17.02, 20.22, 15.84, 20.00, 22.90, 18…
$ vs   <dbl> 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0,…
$ am   <dbl> 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0,…
$ gear <dbl> 4, 4, 4, 3, 3, 3, 3, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 4, 4, 4, 3, 3,…
$ carb <dbl> 4, 4, 1, 1, 2, 1, 4, 2, 2, 4, 4, 3, 3, 3, 4, 4, 4, 1, 2, 1, 1, 2,…

The variable am has type dbl. However, it should be a factor with the values 0 and 1 representing automatic and manual transmissions. Because R has been told that it’s a number, it plots the x-axis on a continuous scale between 0–1.

We can use the command factor(am) to tell R that the x-axis is a factor. This gives us the boxplot we were expecting.

mtcars %>%
  ggplot(aes(factor(am), mpg)) +
  geom_boxplot()

Exercise 6

Use mtcars to max a boxplot showing miles per gallon against number of gears (gear).

Your plot should look like this:

ggplot: Colour

The command used to define which variable is plotted on the x and y axes is aes. This stands for ‘aesthetics,’ and aes is the command which maps variables to visual aspects of a plot. The command ggplot(aes(factor(am), mpg)) converts am to a factor, before and maps the result to the x-axis, mapping mpg to the y-axis.

You can use the colour option in aes to map colours to a variable. A good use of colour would be to enhance the diamonds scatterplot by representing each diamond’s clarity (an ordinal variable) in colour.

diamonds %>%
  ggplot(aes(carat, price, colour = clarity)) +
  geom_point()

What happens if we use mtcars to plot weight against fuel consumption, using colour to show the number of cylinders in each car?

mtcars %>%
  ggplot(aes(wt, mpg, colour = cyl)) +
  geom_point()

The number of cylinders are plotted in continuous shades of blue. This isn’t terrible, but it would be clearer if each number of cylinders had its own colour. This is the same problem we had with some of the other mtcars variables. They have type dbl, when they are actually factors. We can use factor to fix the colours in the same way we fixed the x-axis:

mtcars %>%
  ggplot(aes(wt, mpg, color = factor(cyl))) +
  geom_point()

In this plot it’s easier to see that cars with more cylinders tend to be heavier, and have worse fuel consumption.

Exercise 7

Make a scatterplot using mtcars to show weight on the x-axis, fuel consumption on the y-axis, and one colour for each number of gears.

Your plot should look like this:

Introduction to Markdown

When we introduced R Markdown, we said that it’s a way of combining R with narrative text. This section explains ‘markdown,’ a way to format the narrative text and combine it with your R code into a single document.

Watch the following short video to see a few things that can be done using markdown.

Markdown is a language uses particular characters to style text in the same way you might use menus in a word processor to define headings, font styles, lists etc. You can see some examples of this in your workbook.

  • The # at the start of # lifesaveR: Workbook 1 assigns the text lifesaveR: Workbook 1 as a level 1 heading
  • The sentence below that heading is ordinary text
  • ## An example plot is a level 2 heading
  • The lines beginning 1. under # Exercise 1 create a numbered list, starting at 1

The Knit button combines the Markdown and R chunks, ‘knitting’ them together into an output document. It works through your document converting Markdown to formatted text, and running each of your R chunks, in the order you have written them. We can see that this is the case by ‘knitting’ the R Markdown workbook for this session.

Other markdown features include:

  • *italic text* - italic text
  • **bold text** - bold text

The following simple approach will give you regular practice writing markdown:

  1. Each time you reach a new section in a worksheet, copy the section name (e.g. ‘Exploring and checking data’) into a level 2 heading in your workbook
  2. Before starting each exercise, create a level 3 heading for the exercise number (e.g. ‘Exercise 2’)
  3. Copy the exercise instructions below this as a bullet list
  4. Complete the exercise.

This methodical approach will allow you to complete each workbook step by step. At the end of each session, you will have written a neat, ‘summary’ document which will be a useful revision aid. The same approach works for all R worksheets at the University of Plymouth. If you get into this habit, you will be adding to your ‘reference library’ whenever you complete a worksheet. These summary documents are invaluable when you are more familiar with R but need to quickly remind yourself how to use a particular feature.

Exercise 8

  • For each of the chunks you wrote for Exercises 2-7
    • Add the section name as a level 2 heading
    • Add the exercise number as a level 3 heading
    • Add the instructions as a numbered list (or as plain text if there’s only one instruction)

Check your knowledge

  • What is mtcars?
  • Explain what glimpse does
  • What is the %>% symbol called and what does it do?
  • What is the <- symbol called and what does it do?
  • What is the difference between a dbl and an ord/fct?
  • Give an example of when the difference between dbl and fct matters when making a plot
  • How can you convert a variable from a dbl to a fct
  • What is the difference between geom_jitter() and geom_point()?
  • Why is geom_jitter useful sometimes?

Extensions

  • Lots more practice plots with different datasets?
  • Better plotting worksheet stage 4?

Further reading